Define a gene co-expression similarity comparing biweight midcorrelation vs correlation vs euclidean distance
Elevating the correlation matrix to the best power for scale-free topology
Using a Topological Overlap Matrix as distance matrix
Performing hierarchical clustering using average linkage hclust(method=‘average’)
Extracting clusters using a dynamic brach cutting approach from Dynamic Tree Cut: in-depth description, tests and applications using Dynamic Tree
Merging similar clusters using Module Eigengenes comparing original clusters vs merged clusters vs two main clusters
Load preprocessed dataset (preprocessing code in /../Gandal/data_preprocessing.Rmd)
Keep DE genes
datExpr = datExpr %>% filter(rownames(.) %in% rownames(DE_info)[DE_info$padj<0.05])
rownames(datExpr) = datGenes$feature_id[DE_info$padj<0.05 & !is.na(DE_info$padj)]
datGenes = datGenes %>% filter(feature_id %in% rownames(DE_info)[DE_info$padj<0.05])
DE_info = DE_info %>% filter(padj<0.05)
print(paste0('Keeping ', nrow(datExpr), ' genes and ', ncol(datExpr), ' samples'))
## [1] "Keeping 3000 genes and 80 samples"
allowWGCNAThreads()
## Allowing multi-threading with up to 4 threads.
cor_mat = datExpr %>% t %>% bicor
Correcting the correlation matrix from \(s \in [-1,1]\) to \(s \in [0,1]\) using \(s_{ij}=|bw(i,j)|\)
S = abs(cor_mat)
Sigmoid function: \(a(i,j)=sigmoid(s_{ij}, \alpha, \tau_0) \equiv \frac{1}{1+e^{-\alpha(s_{ij}-\tau_0)}}\)
Power adjacency function: \(a(i,j)=power(s_{ij}, \beta) \equiv |S_{ij}|^\beta\)
Chose power adjacency function over the sigmoid function because it has only one parameter to adjust and both methods are supposed to lead to very similar results if the parameters are chosen with the scale-free topology criterion.
Following the scale-free topology criterion because metabolic networks have been found to display approximate scale free topology
best_power = datExpr %>% t %>% pickSoftThreshold(powerVector = 1:15, RsquaredCut=0.8)
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.00342 0.261 0.988 775.000 770.0000 1330.00
## 2 2 0.24500 -1.490 0.986 257.000 247.0000 636.00
## 3 3 0.58100 -2.220 0.963 98.200 90.3000 331.00
## 4 4 0.74900 -2.510 0.990 41.600 36.6000 186.00
## 5 5 0.83500 -2.560 0.989 19.200 15.8000 110.00
## 6 6 0.87300 -2.590 0.994 9.460 7.2500 67.80
## 7 7 0.90000 -2.520 0.994 4.960 3.4900 43.50
## 8 8 0.90800 -2.440 0.988 2.740 1.7600 28.80
## 9 9 0.88700 -2.380 0.951 1.590 0.9140 19.70
## 10 10 0.39100 -3.380 0.358 0.960 0.4870 13.70
## 11 11 0.38500 -3.210 0.349 0.603 0.2660 9.78
## 12 12 0.93500 -1.980 0.982 0.391 0.1490 7.11
## 13 13 0.95700 -1.830 0.985 0.262 0.0850 5.25
## 14 14 0.93300 -1.770 0.941 0.180 0.0495 3.98
## 15 15 0.94200 -1.780 0.964 0.126 0.0295 3.36
print(paste0('Best power for scale free topology: ', best_power$powerEstimate))
## [1] "Best power for scale free topology: 5"
Elevate the matrix to the suggested power
S_sft = S^best_power$powerEstimate
Additional considerations
Using the power=5 we get a mean connectivity of 20
mean_connectivity = matrix(0, 15, 2) %>% data.frame
colnames(mean_connectivity) = c('power','mean_connectivity')
for(i in 1:15) mean_connectivity[i,] = c(i, mean(colSums(S^i)))
ggplotly(mean_connectivity %>% ggplot(aes(power, mean_connectivity)) + ylab('mean connectivity') +
geom_vline(xintercept=best_power$powerEstimate, color='gray') + geom_point(color='#0099cc') +
scale_y_sqrt() + theme_minimal())
rm(mean_connectivity, best_power)
Slope is 2.6 times as steep as it should be (maybe because the range of k is too narrow?)
k_df = data.frame('connectivity'=colSums(S_sft), 'bucket'=cut(colSums(S_sft), 10)) %>% group_by(bucket) %>%
dplyr::summarize(p_k=n(), k=mean(connectivity)) %>% mutate(p_k=p_k/sum(p_k))
lm(log10(p_k) ~ log10(k), data=k_df)
##
## Call:
## lm(formula = log10(p_k) ~ log10(k), data = k_df)
##
## Coefficients:
## (Intercept) log10(k)
## 2.687 -2.670
rm(k_df)
Note: The slope is very steep, both axis are on sqrt scale
plot_data = data.frame('n'=1:nrow(S)^2, 'S'=sort(melt(S)$value), 'S_sft'=sort(melt(S_sft)$value))
plot_data %>% filter(n%%100==0) %>% dplyr::select(S, S_sft) %>% melt %>%
ggplot(aes(value, color=variable, fill=variable)) + geom_density(alpha=0.5) +
xlab('k') + ylab('p(k)') + scale_x_sqrt() + scale_y_sqrt() + theme_minimal()
rm(plot_data)
The example given in the article has a curve similar to this one and they say it’s OK
k_df = data.frame('connectivity'=colSums(S_sft), 'bucket'=cut(colSums(S_sft), 10)) %>% group_by(bucket) %>%
dplyr::summarize(p_k=n(), k=mean(connectivity)) %>% mutate(p_k=p_k/sum(p_k))
# k_df %>% ggplot(aes(k,p_k)) + geom_point(color='#0099cc') + geom_smooth(method='lm', se=FALSE, color='gray') +
# ylab('p(k)') + scale_x_log10() + scale_y_log10() + theme_minimal()
k_df %>% ggplot(aes(log10(k),log10(p_k))) + geom_point(color='#0099cc') + geom_smooth(method='lm', se=FALSE, color='gray') +
ylab('p(k)') + theme_minimal()
\(R^2=0.83\) The \(R^2\) we got from the pickSoftThreshold function
lm(log10(p_k) ~ log10(k), data=k_df) %>% summary
##
## Call:
## lm(formula = log10(p_k) ~ log10(k), data = k_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6508 -0.3561 0.1792 0.2841 0.4629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6865 0.6685 4.019 0.003848 **
## log10(k) -2.6703 0.3955 -6.751 0.000145 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4222 on 8 degrees of freedom
## Multiple R-squared: 0.8507, Adjusted R-squared: 0.832
## F-statistic: 45.58 on 1 and 8 DF, p-value: 0.0001449
rm(k_df)
Using topological overlap dissimilarity measure because it has been found to result in biologically meaningful modules
1st quartile is already 0.9852, most of the genes are very dissimilar
TOM = S_sft %>% TOMsimilarity
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
dissTOM = 1-TOM
rownames(dissTOM) = rownames(S_sft)
colnames(dissTOM) = colnames(S_sft)
dissTOM %>% melt %>% summary
## Var1 Var2 value
## ENSG00000000971: 3000 ENSG00000000971: 3000 Min. :0.0000
## ENSG00000001084: 3000 ENSG00000001084: 3000 1st Qu.:0.9852
## ENSG00000001626: 3000 ENSG00000001626: 3000 Median :0.9918
## ENSG00000002586: 3000 ENSG00000002586: 3000 Mean :0.9882
## ENSG00000002746: 3000 ENSG00000002746: 3000 3rd Qu.:0.9956
## ENSG00000002933: 3000 ENSG00000002933: 3000 Max. :0.9999
## (Other) :8982000 (Other) :8982000
Using hierarchical clustering on the TOM-based dissimilarity matrix
In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.
It looks like instead of finding clusters, on each new separation this method discards outliers
dend = dissTOM %>% as.dist %>% hclust(method='average')
plot(dend, hang=0, labels=FALSE)
Instead of using a fixed height barch to cut the dendrogram into clusters, using a dynamic branch cutting approach taken from Dynamic Tree Cut: in-depth description, tests and applications
Two available methods:
Dynamic Tree Cut: top-down algorithm relying only on the dendrogram and respecting the order of the clustered objects on it. This method is less sensitive to parameter choice but also less flexible (I think I prefer robustness over flexibility, so I’m going to use this one)
Dynamic Hybrid Cut: builds the clusters from bottom up. In addition to information from the dendrogram, it utilizes dissimilarity information among the objects. Seems to me that relies on too many heuristics and has too many parameters to tune
Plus a post processing step:
Using Dynamic Tree Cut.The justification for this can be found in 4_tree_cutting_methods.Rmd
modules = cutreeDynamicTree(dend, deepSplit=F, minModuleSize=20)
names(modules) = labels(dend)
clusterings[['bicor']] = modules
“One can relate modules to each other by correlating the corresponding module eigengenes (Horvath et al., 2005). If two modules are highly correlated, one may want to merge them”
Calculate the “eigengenes” (1st principal component) of each module
module_colors = c('gray', viridis(max(modules)))
MEs_output = datExpr %>% t %>% moduleEigengenes(colors=modules)
MEs = MEs_output$eigengenes
Merge similar modules
cor_dist = 1-cor(MEs)
dend_MEs = cor_dist %>% as.dist %>% hclust(method='average')
dend_MEs %>% as.dendrogram %>% plot(ylim=c(0, 0.6))
abline(h=0.55, col='#0099cc')
abline(h=0.228, col='#009999')
# Two main modules
tree_cut = cutree(dend_MEs, h=0.55)
top_modules = modules %>% replace(modules %in% (gsub('ME', '', names(tree_cut[tree_cut==1])) %>% as.numeric), 1) %>%
replace(modules %in% (gsub('ME', '', names(tree_cut[tree_cut==2])) %>% as.numeric), 2) %>%
replace(modules == 0, 0)
clusterings[['bicor_top_clusters']] = top_modules
# Closest modules
tree_cut = cutree(dend_MEs, h=0.228)
merged_modules = modules
n=0
for(i in sort(unique(tree_cut))){
n=n+1
merged_modules = merged_modules %>%
replace(modules %in% (gsub('ME', '', names(tree_cut[tree_cut==i])) %>% as.numeric), n)
}
merged_modules = merged_modules %>% replace(modules == 0, 0)
clusterings[['bicor_merged']] = merged_modules
merged_module_colors = c('gray', viridis(length(unique(merged_modules))))
top_module_colors = c('gray', viridis(max(top_modules)))
dend_colors = data.frame('ID'=names(modules[labels(dend)]),
'OriginalModules' = module_colors[modules[dend$order]+1],
'MergedModules' = merged_module_colors[merged_modules[dend$order]+1],
'TopModules' = top_module_colors[top_modules[dend$order]+1])
dend %>% as.dendrogram(hang=0) %>% set('labels', rep('', nrow(dissTOM))) %>% plot(ylim=c(min(dend$height),1))
colored_bars(colors=dend_colors[,-1])
rm(MEs, dend, modules, module_colors, MEs_output, top_modules, merged_modules, tree_cut,
merged_module_colors, top_module_colors, dend_colors)
cor_mat = datExpr %>% t %>% cor
Correcting the correlation matrix from \(s \in [-1,1]\) to \(s \in [0,1]\) using \(s_{ij}=|bw(i,j)|\)
S = abs(cor_mat)
Following the scale-free topology criterion because metabolic networks have been found to display approximate scale free topology
best_power = datExpr %>% t %>% pickSoftThreshold(powerVector = 1:15, RsquaredCut=0.8)
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.00342 0.261 0.988 775.000 770.0000 1330.00
## 2 2 0.24500 -1.490 0.986 257.000 247.0000 636.00
## 3 3 0.58100 -2.220 0.963 98.200 90.3000 331.00
## 4 4 0.74900 -2.510 0.990 41.600 36.6000 186.00
## 5 5 0.83500 -2.560 0.989 19.200 15.8000 110.00
## 6 6 0.87300 -2.590 0.994 9.460 7.2500 67.80
## 7 7 0.90000 -2.520 0.994 4.960 3.4900 43.50
## 8 8 0.90800 -2.440 0.988 2.740 1.7600 28.80
## 9 9 0.88700 -2.380 0.951 1.590 0.9140 19.70
## 10 10 0.39100 -3.380 0.358 0.960 0.4870 13.70
## 11 11 0.38500 -3.210 0.349 0.603 0.2660 9.78
## 12 12 0.93500 -1.980 0.982 0.391 0.1490 7.11
## 13 13 0.95700 -1.830 0.985 0.262 0.0850 5.25
## 14 14 0.93300 -1.770 0.941 0.180 0.0495 3.98
## 15 15 0.94200 -1.780 0.964 0.126 0.0295 3.36
print(paste0('Best power for scale free topology: ', best_power$powerEstimate))
## [1] "Best power for scale free topology: 5"
Elevate the matrix to the suggested power
S_sft = S^best_power$powerEstimate
Additional considerations
Using the power=5 we get a mean connectivity of 20
mean_connectivity = matrix(0, 15, 2) %>% data.frame
colnames(mean_connectivity) = c('power','mean_connectivity')
for(i in 1:15) mean_connectivity[i,] = c(i, mean(colSums(S^i)))
ggplotly(mean_connectivity %>% ggplot(aes(power, mean_connectivity)) + ylab('mean connectivity') +
geom_vline(xintercept=best_power$powerEstimate, color='gray') + geom_point(color='#0099cc') +
scale_y_sqrt() + theme_minimal())
rm(mean_connectivity, best_power)
Slope is 2.7 times as steep as it should be (maybe because the range of k is too narrow?)
k_df = data.frame('connectivity'=colSums(S_sft), 'bucket'=cut(colSums(S_sft), 10)) %>% group_by(bucket) %>%
dplyr::summarize(p_k=n(), k=mean(connectivity)) %>% mutate(p_k=p_k/sum(p_k))
lm(log10(p_k) ~ log10(k), data=k_df)
##
## Call:
## lm(formula = log10(p_k) ~ log10(k), data = k_df)
##
## Coefficients:
## (Intercept) log10(k)
## 2.700 -2.686
rm(k_df)
Note: The slope is very steep, both axis are on sqrt scale
plot_data = data.frame('n'=1:nrow(S)^2, 'S'=sort(melt(S)$value), 'S_sft'=sort(melt(S_sft)$value))
plot_data %>% filter(n%%100==0) %>% dplyr::select(S, S_sft) %>% melt %>%
ggplot(aes(value, color=variable, fill=variable)) + geom_density(alpha=0.5) +
xlab('k') + ylab('p(k)') + scale_x_sqrt() + scale_y_sqrt() + theme_minimal()
rm(plot_data)
The example given in the article has a curve similar to this one and they say it’s OK
k_df = data.frame('connectivity'=colSums(S_sft), 'bucket'=cut(colSums(S_sft), 10)) %>% group_by(bucket) %>%
dplyr::summarize(p_k=n(), k=mean(connectivity)) %>% mutate(p_k=p_k/sum(p_k))
# k_df %>% ggplot(aes(k,p_k)) + geom_point(color='#0099cc') + geom_smooth(method='lm', se=FALSE, color='gray') +
# ylab('p(k)') + scale_x_log10() + scale_y_log10() + theme_minimal()
k_df %>% ggplot(aes(log10(k),log10(p_k))) + geom_point(color='#0099cc') + geom_smooth(method='lm', se=FALSE, color='gray') +
ylab('p(k)') + theme_minimal()
\(R^2=0.83\) The \(R^2\) we got from the pickSoftThreshold function
lm(log10(p_k) ~ log10(k), data=k_df) %>% summary
##
## Call:
## lm(formula = log10(p_k) ~ log10(k), data = k_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6671 -0.3468 0.1154 0.3334 0.4489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7000 0.6801 3.970 0.004119 **
## log10(k) -2.6857 0.4049 -6.633 0.000164 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4242 on 8 degrees of freedom
## Multiple R-squared: 0.8462, Adjusted R-squared: 0.8269
## F-statistic: 44 on 1 and 8 DF, p-value: 0.0001636
rm(k_df)
Using topological overlap dissimilarity measure because it has been found to result in biologically meaningful modules
1st quartile is already 0.9852, most of the genes are very dissimilar
TOM = S_sft %>% TOMsimilarity
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
dissTOM = 1-TOM
rownames(dissTOM) = rownames(S_sft)
colnames(dissTOM) = colnames(S_sft)
dissTOM %>% melt %>% summary
## Var1 Var2 value
## ENSG00000000971: 3000 ENSG00000000971: 3000 Min. :0.0000
## ENSG00000001084: 3000 ENSG00000001084: 3000 1st Qu.:0.9857
## ENSG00000001626: 3000 ENSG00000001626: 3000 Median :0.9921
## ENSG00000002586: 3000 ENSG00000002586: 3000 Mean :0.9885
## ENSG00000002746: 3000 ENSG00000002746: 3000 3rd Qu.:0.9958
## ENSG00000002933: 3000 ENSG00000002933: 3000 Max. :0.9999
## (Other) :8982000 (Other) :8982000
Using hierarchical clustering on the TOM-based dissimilarity matrix
In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.
It looks like instead of finding clusters, on each new separation this method discards outliers
dend = dissTOM %>% as.dist %>% hclust(method='average')
plot(dend, hang=0, labels=FALSE)
Instead of using a fixed height barch to cut the dendrogram into clusters, using a dynamic branch cutting approach taken from Dynamic Tree Cut: in-depth description, tests and applications
modules = cutreeDynamicTree(dend, deepSplit=F, minModuleSize=20)
names(modules) = labels(dend)
clusterings[['cor']] = modules
“One can relate modules to each other by correlating the corresponding module eigengenes (Horvath et al., 2005). If two modules are highly correlated, one may want to merge them”
Calculate the “eigengenes” (1st principal component) of each module
module_colors = c('gray', viridis(max(modules)))
MEs_output = datExpr %>% t %>% moduleEigengenes(colors=modules)
MEs = MEs_output$eigengenes
Merge similar modules
cor_dist = 1-cor(MEs)
dend_MEs = cor_dist %>% as.dist %>% hclust(method='average')
dend_MEs %>% as.dendrogram %>% plot(ylim=c(0, 0.8))
abline(h=0.65, col='#0099cc')
abline(h=0.38, col='#009999')
# Two main modules
tree_cut = cutree(dend_MEs, h=0.65)
top_modules = modules %>% replace(modules %in% (gsub('ME', '', names(tree_cut[tree_cut==1])) %>% as.numeric), 1) %>%
replace(modules %in% (gsub('ME', '', names(tree_cut[tree_cut==2])) %>% as.numeric), 2) %>%
replace(modules == 0, 0)
clusterings[['cor_top_clusters']] = top_modules
# Closest modules
tree_cut = cutree(dend_MEs, h=0.38)
merged_modules = modules
n=0
for(i in sort(unique(tree_cut))){
n=n+1
merged_modules = merged_modules %>%
replace(modules %in% (gsub('ME', '', names(tree_cut[tree_cut==i])) %>% as.numeric), n)
}
merged_modules = merged_modules %>% replace(modules == 0, 0)
clusterings[['cor_merged']] = merged_modules
merged_module_colors = c('gray', viridis(length(unique(merged_modules))))
top_module_colors = c('gray', viridis(max(top_modules)))
dend_colors = data.frame('ID'=names(modules[labels(dend)]),
'OriginalModules' = module_colors[modules[dend$order]+1],
'MergedModules' = merged_module_colors[merged_modules[dend$order]+1],
'TopModules' = top_module_colors[top_modules[dend$order]+1])
dend %>% as.dendrogram(hang=0) %>% set('labels', rep('', nrow(dissTOM))) %>% plot(ylim=c(min(dend$height),1))
colored_bars(colors=dend_colors[,-1])
rm(MEs, dend, modules, module_colors, MEs_output, top_modules, merged_modules, tree_cut,
merged_module_colors, top_module_colors, dend_colors)
cor_mat = datExpr %>% dist
Correcting the correlation matrix from \(s \in [-1,1]\) to \(s \in [0,1]\) using \(s_{ij}=|bw(i,j)|\)
S = as.matrix((cor_mat-min(cor_mat))/(max(cor_mat)-min(cor_mat)))
Following the scale-free topology criterion because metabolic networks have been found to display approximate scale free topology
best_power = datExpr %>% t %>% pickSoftThreshold(powerVector = 1:15, RsquaredCut=0.8)
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.00342 0.261 0.988 775.000 770.0000 1330.00
## 2 2 0.24500 -1.490 0.986 257.000 247.0000 636.00
## 3 3 0.58100 -2.220 0.963 98.200 90.3000 331.00
## 4 4 0.74900 -2.510 0.990 41.600 36.6000 186.00
## 5 5 0.83500 -2.560 0.989 19.200 15.8000 110.00
## 6 6 0.87300 -2.590 0.994 9.460 7.2500 67.80
## 7 7 0.90000 -2.520 0.994 4.960 3.4900 43.50
## 8 8 0.90800 -2.440 0.988 2.740 1.7600 28.80
## 9 9 0.88700 -2.380 0.951 1.590 0.9140 19.70
## 10 10 0.39100 -3.380 0.358 0.960 0.4870 13.70
## 11 11 0.38500 -3.210 0.349 0.603 0.2660 9.78
## 12 12 0.93500 -1.980 0.982 0.391 0.1490 7.11
## 13 13 0.95700 -1.830 0.985 0.262 0.0850 5.25
## 14 14 0.93300 -1.770 0.941 0.180 0.0495 3.98
## 15 15 0.94200 -1.780 0.964 0.126 0.0295 3.36
print(paste0('Best power for scale free topology: ', best_power$powerEstimate))
## [1] "Best power for scale free topology: 5"
Elevate the matrix to the suggested power
S_sft = S^best_power$powerEstimate
Additional considerations
Using the power=5 we get a mean connectivity of 17
mean_connectivity = matrix(0, 15, 2) %>% data.frame
colnames(mean_connectivity) = c('power','mean_connectivity')
for(i in 1:15) mean_connectivity[i,] = c(i, mean(colSums(S^i)))
ggplotly(mean_connectivity %>% ggplot(aes(power, mean_connectivity)) + ylab('mean connectivity') +
geom_vline(xintercept=best_power$powerEstimate, color='gray') + geom_point(color='#0099cc') +
scale_y_sqrt() + theme_minimal())
rm(mean_connectivity, best_power)
Slope is 2.4 times as steep as it should be (maybe because the range of k is too narrow?)
k_df = data.frame('connectivity'=colSums(S_sft), 'bucket'=cut(colSums(S_sft), 10)) %>% group_by(bucket) %>%
dplyr::summarize(p_k=n(), k=mean(connectivity)) %>% mutate(p_k=p_k/sum(p_k))
lm(log10(p_k) ~ log10(k), data=k_df)
##
## Call:
## lm(formula = log10(p_k) ~ log10(k), data = k_df)
##
## Coefficients:
## (Intercept) log10(k)
## 2.683 -2.399
rm(k_df)
Note: The slope is very steep, both axis are on sqrt scale
plot_data = data.frame('n'=1:nrow(S)^2, 'S'=sort(melt(S)$value), 'S_sft'=sort(melt(S_sft)$value))
plot_data %>% filter(n%%100==0) %>% dplyr::select(S, S_sft) %>% melt %>%
ggplot(aes(value, color=variable, fill=variable)) + geom_density(alpha=0.5) +
xlab('k') + ylab('p(k)') + scale_x_sqrt() + scale_y_sqrt() + theme_minimal()
rm(plot_data)
This curve is noisier than the first two
k_df = data.frame('connectivity'=colSums(S_sft), 'bucket'=cut(colSums(S_sft), 10)) %>% group_by(bucket) %>%
dplyr::summarize(p_k=n(), k=mean(connectivity)) %>% mutate(p_k=p_k/sum(p_k))
# k_df %>% ggplot(aes(k,p_k)) + geom_point(color='#0099cc') + geom_smooth(method='lm', se=FALSE, color='gray') +
# ylab('p(k)') + scale_x_log10() + scale_y_log10() + theme_minimal()
k_df %>% ggplot(aes(log10(k),log10(p_k))) + geom_point(color='#0099cc') + geom_smooth(method='lm', se=FALSE, color='gray') +
ylab('p(k)') + theme_minimal()
\(R^2=0.93\) The \(R^2\) we got from the pickSoftThreshold function
lm(log10(p_k) ~ log10(k), data=k_df) %>% summary
##
## Call:
## lm(formula = log10(p_k) ~ log10(k), data = k_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60139 -0.06648 -0.02686 0.14496 0.49930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6830 0.5043 5.32 0.0011 **
## log10(k) -2.3986 0.2255 -10.63 1.42e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3218 on 7 degrees of freedom
## Multiple R-squared: 0.9417, Adjusted R-squared: 0.9334
## F-statistic: 113.1 on 1 and 7 DF, p-value: 1.424e-05
rm(k_df)
Using topological overlap dissimilarity measure because it has been found to result in biologically meaningful modules
1st quartile is already 0.9852, most of the genes are very dissimilar
TOM = S_sft %>% TOMsimilarity
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
dissTOM = 1-TOM
rownames(dissTOM) = rownames(S_sft)
colnames(dissTOM) = colnames(S_sft)
dissTOM %>% melt %>% summary
## Var1 Var2 value
## ENSG00000000971: 3000 ENSG00000000971: 3000 Min. :0.0000
## ENSG00000001084: 3000 ENSG00000001084: 3000 1st Qu.:0.9608
## ENSG00000001626: 3000 ENSG00000001626: 3000 Median :0.9865
## ENSG00000002586: 3000 ENSG00000002586: 3000 Mean :0.9697
## ENSG00000002746: 3000 ENSG00000002746: 3000 3rd Qu.:0.9943
## ENSG00000002933: 3000 ENSG00000002933: 3000 Max. :0.9989
## (Other) :8982000 (Other) :8982000
Using hierarchical clustering on the TOM-based dissimilarity matrix
In average linkage hierarchical clustering, the distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.
It looks like it separates the samples into two and then, instead of finding subclusters, on each new separation it discards outliers
dend = dissTOM %>% as.dist %>% hclust(method='average')
plot(dend, hang=0.01, labels=FALSE)
Instead of using a fixed height barch to cut the dendrogram into clusters, using a dynamic branch cutting approach taken from Dynamic Tree Cut: in-depth description, tests and applications
modules = cutreeDynamicTree(dend, deepSplit=F, minModuleSize=20)
names(modules) = labels(dend)
clusterings[['euc']] = modules
“One can relate modules to each other by correlating the corresponding module eigengenes (Horvath et al., 2005). If two modules are highly correlated, one may want to merge them”
Calculate the “eigengenes” (1st principal component) of each module
module_colors = c('gray', viridis(max(modules)))
MEs_output = datExpr %>% t %>% moduleEigengenes(colors=modules)
MEs = MEs_output$eigengenes
Merge similar modules
cor_dist = 1-cor(MEs)
dend_MEs = cor_dist %>% as.dist %>% hclust(method='average')
dend_MEs %>% as.dendrogram %>% plot
abline(h=0.5, col='#0099cc')
# Two main modules
tree_cut = cutree(dend_MEs, h=0.5)
top_modules = modules %>% replace(modules %in% (gsub('ME', '', names(tree_cut[tree_cut==1])) %>% as.numeric), 1) %>%
replace(modules %in% (gsub('ME', '', names(tree_cut[tree_cut==2])) %>% as.numeric), 2) %>%
replace(modules == 0, 0)
clusterings[['euc_top_clusters']] = top_modules
top_module_colors = c('gray', viridis(max(top_modules)))
dend_colors = data.frame('ID'=names(modules[labels(dend)]),
'OriginalModules' = module_colors[modules[dend$order]+1],
'TopModules' = top_module_colors[top_modules[dend$order]+1])
dend %>% as.dendrogram(hang=0.1) %>% plot
colored_bars(colors=dend_colors[,-1])
rm(MEs, dend, modules, module_colors, MEs_output, top_modules, merged_modules, tree_cut,
merged_module_colors, top_module_colors, dend_colors)
## Warning in rm(MEs, dend, modules, module_colors, MEs_output, top_modules, :
## object 'merged_modules' not found
## Warning in rm(MEs, dend, modules, module_colors, MEs_output, top_modules, :
## object 'merged_module_colors' not found
cluster_sim = data.frame(matrix(nrow = length(clusterings), ncol = length(clusterings)))
for(i in 1:(length(clusterings))){
cluster1 = sub(0, NA, clusterings[[i]]) %>% as.factor
for(j in (i):length(clusterings)){
cluster2 = sub(0, NA, clusterings[[j]]) %>% as.factor
cluster_sim[i,j] = adj.rand.index(cluster1, cluster2)
}
}
colnames(cluster_sim) = names(clusterings)
rownames(cluster_sim) = colnames(cluster_sim)
cluster_sim = cluster_sim %>% as.matrix %>% round(2)
heatmap.2(x = cluster_sim, Rowv = F, Colv = F, dendrogram = 'none', col=rev(brewer.pal(9,'YlOrRd'))[5:9],
cellnote = cluster_sim, notecol = 'black', trace = 'none', key = FALSE,
cexRow = 1, cexCol = 1, margins = c(12,12))
rm(i, j, cluster1, cluster2, cluster_sim)
Cluster don’t follow any strong patterns, at least in the first principal components
pca = datExpr %>% prcomp
plot_data = data.frame('ID' = rownames(datExpr), 'PC1' = pca$x[,1], 'PC2' = pca$x[,2],
'PC3' = pca$x[,3], 'PC4' = pca$x[,4], 'PC5' = pca$x[,5],
'bicor' = sub(0,NA,clusterings[['bicor']]) %>% as.factor,
'bicor_top_clusters' = sub(0,NA,clusterings[['bicor_top_clusters']]) %>% as.factor,
'bicor_merged' = sub(0,NA,clusterings[['bicor_merged']]) %>% as.factor,
'cor' = sub(0,NA,clusterings[['cor']]) %>% as.factor,
'cor_top_clusters' = sub(0,NA,clusterings[['cor_top_clusters']]) %>% as.factor,
'cor_merged' = sub(0,NA,clusterings[['cor_merged']]) %>% as.factor,
'euc' = sub(0,NA,clusterings[['euc']]) %>% as.factor,
'euc_top_clusters' = sub(0,NA,clusterings[['euc_top_clusters']]) %>% as.factor)#,
#'euc_merged' = sub(0,NA,clusterings[['euc_merged']]) %>% as.factor)
selectable_scatter_plot(plot_data[,-1,], plot_data[,-1])
rm(pca, plot_data)
Using Euclidean distance creates only three clusters defined by their mean expression. It doesn’t eem to be a good metric because it doesn’t capture any more complex or detailed behaviours in the data
Correlation based distance metrics seem to capture a type of structure in the data not very related to the first principal components, although genes with high values for the 2nd PC seem to cluster together consistently (it’s easiest to see this on the top clusters)
Comparing the top 2 modules of the correlation based methods it’s not easy to decide which method is better
I think biweight midcorrelation may be the best choice because it’s more robust than correlation
clusterings_file = './../Data/Gandal/clusterings.csv'
if(file.exists(clusterings_file)){
df = read.csv(clusterings_file, row.names=1)
if(!all(rownames(df) == rownames(datExpr))) stop('Gene ordering does not match the one in clusterings.csv!')
for(clustering in names(clusterings)){
df = df %>% mutate(!!clustering := sub(0, NA, clusterings[[clustering]]))
rownames(df) = rownames(datExpr)
}
} else {
df = clusterings %>% unlist %>% matrix(nrow=length(clusterings), byrow=T) %>% t %>% data.frame %>% na_if(0)
colnames(df) = names(clusterings)
rownames(df) = rownames(datExpr)
}
write.csv(df, file=clusterings_file)
rm(clusterings_file, df, clustering)
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: Scientific Linux 7.6 (Nitrogen)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] pdfCluster_1.0-3 doParallel_1.0.14
## [3] iterators_1.0.9 foreach_1.4.4
## [5] WGCNA_1.66 fastcluster_1.1.25
## [7] dynamicTreeCut_1.63-1 sva_3.30.1
## [9] genefilter_1.64.0 mgcv_1.8-26
## [11] nlme_3.1-137 DESeq2_1.22.2
## [13] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
## [15] BiocParallel_1.16.6 matrixStats_0.54.0
## [17] Biobase_2.42.0 GenomicRanges_1.34.0
## [19] GenomeInfoDb_1.18.2 IRanges_2.16.0
## [21] S4Vectors_0.20.1 BiocGenerics_0.28.0
## [23] biomaRt_2.38.0 gplots_3.0.1
## [25] dendextend_1.10.0 gridExtra_2.3
## [27] viridis_0.5.1 viridisLite_0.3.0
## [29] RColorBrewer_1.1-2 plotlyutils_0.0.0.9000
## [31] plotly_4.8.0 glue_1.3.1
## [33] reshape2_1.4.3 forcats_0.3.0
## [35] stringr_1.4.0 dplyr_0.8.0.1
## [37] purrr_0.3.1 readr_1.3.1
## [39] tidyr_0.8.3 tibble_2.1.1
## [41] ggplot2_3.1.0 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] readxl_1.1.0 backports_1.1.2 Hmisc_4.1-1
## [4] plyr_1.8.4 lazyeval_0.2.2 splines_3.5.2
## [7] crosstalk_1.0.0 robust_0.4-18 digest_0.6.18
## [10] htmltools_0.3.6 GO.db_3.7.0 gdata_2.18.0
## [13] magrittr_1.5 checkmate_1.8.5 memoise_1.1.0
## [16] fit.models_0.5-14 cluster_2.0.7-1 limma_3.38.3
## [19] annotate_1.60.1 modelr_0.1.4 prettyunits_1.0.2
## [22] colorspace_1.4-1 rrcov_1.4-3 blob_1.1.1
## [25] rvest_0.3.2 haven_1.1.1 xfun_0.5
## [28] crayon_1.3.4 RCurl_1.95-4.10 jsonlite_1.6
## [31] impute_1.56.0 survival_2.43-3 gtable_0.2.0
## [34] zlibbioc_1.28.0 XVector_0.22.0 kernlab_0.9-27
## [37] prabclus_2.2-7 DEoptimR_1.0-8 abind_1.4-5
## [40] scales_1.0.0 mvtnorm_1.0-7 DBI_1.0.0
## [43] Rcpp_1.0.1 xtable_1.8-3 progress_1.2.0
## [46] htmlTable_1.11.2 magic_1.5-9 foreign_0.8-71
## [49] bit_1.1-14 mclust_5.4 preprocessCore_1.44.0
## [52] Formula_1.2-3 htmlwidgets_1.3 httr_1.4.0
## [55] fpc_2.1-11.1 acepack_1.4.1 modeltools_0.2-22
## [58] pkgconfig_2.0.2 XML_3.98-1.11 flexmix_2.3-15
## [61] nnet_7.3-12 locfit_1.5-9.1 later_0.8.0
## [64] labeling_0.3 tidyselect_0.2.5 rlang_0.3.2
## [67] AnnotationDbi_1.44.0 munsell_0.5.0 cellranger_1.1.0
## [70] tools_3.5.2 cli_1.1.0 generics_0.0.2
## [73] RSQLite_2.1.1 broom_0.5.1 geometry_0.4.0
## [76] evaluate_0.13 yaml_2.2.0 knitr_1.22
## [79] bit64_0.9-7 robustbase_0.93-0 caTools_1.17.1
## [82] mime_0.6 whisker_0.3-2 xml2_1.2.0
## [85] compiler_3.5.2 rstudioapi_0.7 geneplotter_1.60.0
## [88] pcaPP_1.9-73 stringi_1.4.3 lattice_0.20-38
## [91] trimcluster_0.1-2.1 Matrix_1.2-15 pillar_1.3.1
## [94] data.table_1.12.0 bitops_1.0-6 httpuv_1.5.0
## [97] R6_2.4.0 latticeExtra_0.6-28 promises_1.0.1
## [100] KernSmooth_2.23-15 codetools_0.2-15 MASS_7.3-51.1
## [103] gtools_3.5.0 assertthat_0.2.1 withr_2.1.2
## [106] GenomeInfoDbData_1.2.0 diptest_0.75-7 hms_0.4.2
## [109] grid_3.5.2 rpart_4.1-13 class_7.3-14
## [112] rmarkdown_1.12 Cairo_1.5-9 shiny_1.2.0
## [115] lubridate_1.7.4 base64enc_0.1-3